Note
Click here to download the full example code
Dataset summary: Dengue - grouped by patient¶
Report generated using dataprep.
Dataset Statistics
| Number of Variables | 14 |
|---|---|
| Number of Rows | 14484 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 32 |
| Duplicate Rows (%) | 0.2% |
| Total Size in Memory | 3.3 MB |
| Average Row Size in Memory | 235.4 B |
Variable Types
| Categorical | 9 |
|---|---|
| Numerical | 5 |
dsource
categorical
| Distinct Count | 10 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.7 MB |
Length
| Mean | 3.1826 |
|---|---|
| Standard Deviation | 0.9882 |
| Median | 4 |
| Minimum | 2 |
| Maximum | 5 |
Sample
| 1st row | 01nva |
|---|---|
| 2nd row | 01nva |
| 3rd row | 01nva |
| 4th row | 01nva |
| 5th row | 01nva |
Letter
| Count | 28967 |
|---|---|
| Lowercase Letter | 28967 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 17130 |
age
numerical
| Distinct Count | 27 |
|---|---|
| Unique (%) | 0.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.0 MB |
| Mean | 8.2995 |
| Minimum | 0 |
| Maximum | 18 |
| Zeros | 4 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 2 |
| Q1 | 5 |
| Median | 8 |
| Q3 | 11 |
| 95-th Percentile | 14 |
| Maximum | 18 |
| Range | 18 |
| IQR | 6 |
Descriptive Statistics
| Mean | 8.2995 |
|---|---|
| Standard Deviation | 3.9859 |
| Variance | 15.8876 |
| Sum | 120210.5 |
| Skewness | -0.02044 |
| Kurtosis | -0.8352 |
| Coefficient of Variation | 0.4803 |
gender
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8737 |
|---|---|
| Standard Deviation | 0.992 |
| Median | 4 |
| Minimum | 4 |
| Maximum | 6 |
Sample
| 1st row | Male |
|---|---|
| 2nd row | Female |
| 3rd row | Female |
| 4th row | Male |
| 5th row | Female |
Letter
| Count | 70590 |
|---|---|
| Lowercase Letter | 56106 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
weight
numerical
| Distinct Count | 353 |
|---|---|
| Unique (%) | 2.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.0 MB |
| Mean | 28.4854 |
| Minimum | 7.2 |
| Maximum | 114 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 7.2 |
|---|---|
| 5-th Percentile | 12 |
| Q1 | 19 |
| Median | 26 |
| Q3 | 37 |
| 95-th Percentile | 52 |
| Maximum | 114 |
| Range | 106.8 |
| IQR | 18 |
Descriptive Statistics
| Mean | 28.4854 |
|---|---|
| Standard Deviation | 12.8 |
| Variance | 163.8408 |
| Sum | 412582.6 |
| Skewness | 0.8739 |
| Kurtosis | 0.8789 |
| Coefficient of Variation | 0.4494 |
bleeding
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.7429 |
|---|---|
| Standard Deviation | 0.4371 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 68696 |
|---|---|
| Lowercase Letter | 54212 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
plt
numerical
| Distinct Count | 1144 |
|---|---|
| Unique (%) | 7.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.0 MB |
| Mean | 167.1835 |
| Minimum | 3 |
| Maximum | 829 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 3 |
|---|---|
| 5-th Percentile | 24 |
| Q1 | 71 |
| Median | 169 |
| Q3 | 243 |
| 95-th Percentile | 338 |
| Maximum | 829 |
| Range | 826 |
| IQR | 172 |
Descriptive Statistics
| Mean | 167.1835 |
|---|---|
| Standard Deviation | 104.5549 |
| Variance | 10931.7272 |
| Sum | 2.4215e+06 |
| Skewness | 0.4243 |
| Kurtosis | -0.1921 |
| Coefficient of Variation | 0.6254 |
shock
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.9516 |
|---|---|
| Standard Deviation | 0.2146 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | True |
| 3rd row | True |
| 4th row | True |
| 5th row | True |
Letter
| Count | 71719 |
|---|---|
| Lowercase Letter | 57235 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
haematocrit_percent
numerical
| Distinct Count | 560 |
|---|---|
| Unique (%) | 3.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.0 MB |
| Mean | 41.3015 |
| Minimum | 21 |
| Maximum | 67.05 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 21 |
|---|---|
| 5-th Percentile | 33.5 |
| Q1 | 37.2 |
| Median | 40.3 |
| Q3 | 45 |
| 95-th Percentile | 52 |
| Maximum | 67.05 |
| Range | 46.05 |
| IQR | 7.8 |
Descriptive Statistics
| Mean | 41.3015 |
|---|---|
| Standard Deviation | 5.6372 |
| Variance | 31.7783 |
| Sum | 598210.2593 |
| Skewness | 0.6312 |
| Kurtosis | 0.1445 |
| Coefficient of Variation | 0.1365 |
bleeding_gum
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8903 |
|---|---|
| Standard Deviation | 0.3125 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 70831 |
|---|---|
| Lowercase Letter | 56347 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
abdominal_pain
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.682 |
|---|---|
| Standard Deviation | 0.4657 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | True |
| 3rd row | True |
| 4th row | True |
| 5th row | True |
Letter
| Count | 67814 |
|---|---|
| Lowercase Letter | 53330 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
ascites
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8391 |
|---|---|
| Standard Deviation | 0.3675 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | False |
| 4th row | False |
| 5th row | False |
Letter
| Count | 70089 |
|---|---|
| Lowercase Letter | 55605 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
bleeding_mucosal
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8159 |
|---|---|
| Standard Deviation | 0.3876 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 69754 |
|---|---|
| Lowercase Letter | 55270 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
bleeding_skin
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.5429 |
|---|---|
| Standard Deviation | 0.4982 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 65800 |
|---|---|
| Lowercase Letter | 51316 |
| Space Separator | 0 |
| Uppercase Letter | 14484 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
body_temperature
numerical
| Distinct Count | 1218 |
|---|---|
| Unique (%) | 8.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.0 MB |
| Mean | 37.8323 |
| Minimum | 35 |
| Maximum | 41.5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 35 |
|---|---|
| 5-th Percentile | 37 |
| Q1 | 37.2 |
| Median | 37.58 |
| Q3 | 38.3333 |
| 95-th Percentile | 39.5 |
| Maximum | 41.5 |
| Range | 6.5 |
| IQR | 1.1333 |
Descriptive Statistics
| Mean | 37.8323 |
|---|---|
| Standard Deviation | 0.8257 |
| Variance | 0.6818 |
| Sum | 547963.5822 |
| Skewness | 0.9208 |
| Kurtosis | 0.2582 |
| Coefficient of Variation | 0.02183 |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import pandas as pd
import numpy as np
from dataprep.eda import create_report
from pkgname.utils.data_loader import load_dengue, IQR_rule
from pkgname.utils.print_utils import suppress_stdout, suppress_stderr
features = ["dsource", "age", "gender", "weight", "bleeding", "plt",
"shock", "haematocrit_percent", "bleeding_gum", "abdominal_pain",
"ascites", "bleeding_mucosal", "bleeding_skin", "body_temperature"]
with suppress_stdout() and suppress_stderr():
df = load_dengue(usecols=['study_no']+features)
for feat in features:
df[feat] = df.groupby('study_no')[feat].ffill().bfill()
df = df.loc[df['age'] <= 18]
df = df.dropna()
df = df.groupby(by="study_no", dropna=False).agg(
dsource=pd.NamedAgg(column="dsource", aggfunc="last"),
age=pd.NamedAgg(column="age", aggfunc="max"),
gender=pd.NamedAgg(column="gender", aggfunc="first"),
weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
bleeding=pd.NamedAgg(column="bleeding", aggfunc="max"),
plt=pd.NamedAgg(column="plt", aggfunc="min"),
shock=pd.NamedAgg(column="shock", aggfunc="max"),
haematocrit_percent=pd.NamedAgg(column="haematocrit_percent", aggfunc="max"),
bleeding_gum=pd.NamedAgg(column="bleeding_gum", aggfunc="max"),
abdominal_pain=pd.NamedAgg(column="abdominal_pain", aggfunc="max"),
ascites=pd.NamedAgg(column="ascites", aggfunc="max"),
bleeding_mucosal=pd.NamedAgg(column="bleeding_mucosal", aggfunc="max"),
bleeding_skin=pd.NamedAgg(column="bleeding_skin", aggfunc="max"),
body_temperature=pd.NamedAgg(column="body_temperature", aggfunc=np.mean),
).dropna()
df = IQR_rule(df, ['plt'])
report = create_report(df, title="Dengue dataset report")
report
|
Total running time of the script: ( 0 minutes 5.308 seconds)